PyL3dMD: Python LAMMPS 3D molecular descriptors package

Molecular descriptors characterize the biological, physical, and chemical properties of molecules and have long been used for understanding molecular interactions and facilitating materials design. Some of the most robust descriptors are derived from geometrical representations of molecules, called 3-dimensional (3D) descriptors. When calculated from molecular dynamics (MD) simulation trajectories, 3D descriptors can also capture the effects of operating conditions such as temperature or pressure. However, extracting 3D descriptors from MD trajectories is non-trivial, which hinders their wide use by researchers developing advanced quantitative-structure–property-relationship models using machine learning. Here, we describe a suite of open-source Python-based post-processing routines, called PyL3dMD, for calculating 3D descriptors from MD simulations. PyL3dMD is compatible with the popular simulation package LAMMPS and enables users to compute more than 2000 3D molecular descriptors from atomic trajectories generated by MD simulations. PyL3dMD is freely available via GitHub and can be easily installed and used as a highly flexible Python package on all major platforms (Windows, Linux, and macOS). A performance benchmark study used descriptors calculated by PyL3dMD to develop a neural network and the results showed that PyL3dMD is fast and efficient in calculating descriptors for large and complex molecular systems with long simulation durations. PyL3dMD facilitates the calculation of 3D molecular descriptors using MD simulations, making it a valuable tool for cheminformatics studies. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00737-5.


Introduction
Molecular descriptors characterize the biological, physical, and chemical properties of molecules and have long been used for understanding molecular interactions and facilitating materials design. Some of the most robust descriptors are derived from geometrical representations of molecules, called geometric or 3-dimensional (3D) descriptors. When calculated from molecular dynamics (MD) simulation trajectories, 3D descriptors can also capture the effects of operating conditions such as temperature or pressure. However, extracting 3D descriptors from MD trajectories is non-trivial, which hinders their wide use by researchers developing advanced quantitative-structure-property-relationship models using machine learning. Here, we describe a suite of open-source Python-based postprocessing routines, called PyL3dMD, for calculating 3D descriptors from MD simulations. PyL3dMD is compatible with the popular simulation package LAMMPS and enables users to compute more than 2000 3D molecular descriptors from atomic trajectories generated by MD simulations.  The detailed list of descriptors with their description can be found in the excel worksheet on the GitHub: https://github.com/panwarp/PyL3dMD.

Installation Instructions
For simplification purposes, users should use Anaconda (https:// www.anaconda.com) to install Python and packages.
Install on a Windows or iOS system: 1. From here, download the Anaconda for the Windows or iOS operating system. 2. Install Anaconda by double clicking on the Anaconda installer ".exe" file and following the prompt on the installer. 3. The package has been uploaded to PyPi (https://pypi.org/project/PyL3dMD/). Users can install the package using the command "pip install pyl3dmd". i.
Run "Anaconda Prompt" as the administrator. ii.
Press "Enter" on the keyboard to start the installation. iv.
The installation will finish in seconds.
Install on the Linux Platform: Since the PyL3dMD package supports parallel computing using multiple CPUs, users can accelerate the descriptor calculation using a highperformance computer (HPC) cluster. The installation on an HPC is similar: To uninstall PyL3dMD, users can use command "pip uninstall pyl3dmd".

Example Usage of the Package
To make getting started easier, we provided some sample files. These files contain the usage of the package on a local computer and an HPC found on GitHub page at https://github.com/panwarp/PyL3dMD.
Here is the structure of the files, where there are two folders (one for local computer and the other for HPC): This Python file is ready to be run to calculate the 'set1' descriptors based on the provided data and trajectory files using 16 CPU processors for parallel computation. However, if users wish to calculate descriptors for their own models, the names and paths of their own data and trajectory files should be provided in sample.py. Besides, which descriptors to be calculated and the number of CPU processors can be customed by users. Generally, more processors will accelerate the calculation. If optional inputs are not provided, PyL3dMD will calculate all the descriptors using the maximum available processors in a system.
After running this Python code, multiple .csv files (one file for each molecule) containing the descriptors of the model will be generated.

Property Descriptors
Density -The simulation-calculated density is calculated by the ratio of total mass ( ) of atoms in the simulation box and the volume ( ) of simulation box at an operating condition:

Topology and Connectivity Descriptors
Adjacency Matrix -The adjacency matrix is a square symmetric of size × , where is the number of vertices (atoms). The entries of the adjacency matrix are equal one if vertices v and v are adjacent (i.e., the atoms and are bonded), and zero otherwise. Mathematically [1,3], Where E(G) is the set of graph edges (bonds). The adjacency matrix is usually derived from the H-depleted molecular graph.
Distance matrix -The distance matrix is a square symmetric of size × , where is the number of vertices (atoms). It is sometimes also called 2D-distance matrix or topological distance matrix. The entries of the matrix are the topological distances between all the pairs of graph vertices. The topological distance is the number of edges (bonds) along the shortest path min between the vertices v and v . The diagonal entries the distance matrix equal zero whereas the off-diagonal entries equal one if vertices v and v are adjacent (that is, the atoms and are bonded and = = 1, where are elements of the adjacency matrix A) and are greater than one otherwise. Mathematically [1,3], The adjacency matrix encodes information about vertex connectivity and the distance matrix encodes information about relative locations of graph vertices [1,3].
Reciprocal Distance Matrix -The reciprocal distance matrix −1 is a square symmetric of size × , where is the number of vertices (atoms). The entries of reciprocal distance matrix are the inverse of entries of the topical distance matrix . Mathematically [1,3], Vertex Degree Matrix -The vertex degree matrix V is a square symmetric of size × , where is the number of vertices (atoms), whose diagonal entries are the vertex degrees δ . Mathematically [1,3], Molecular Matrix -The molecular matrix is a rectangular matrix of size × 3 where is the number of vertices (atoms). The entries of are the atom spatial coordinates ( , , ) of the molecule. Mathematically [1,3], Geometrical Distance Matrix -The geometrical distance matrix ( ) is a square symmetric of size × , where is the number of vertices (atoms). It is sometimes called 3D-distance matrix or Euclidean distance matrix. Each entry of the geometrical distance matrix is the Euclidean distance (intramolecular interatomic distance) between the atoms and , that is, between each pair of atoms of a molecule. Mathematically [1,3], The geometry matrix does not contain information about atom connectivity. Therefore, it is accompanied by a connectivity table (list of all bonded atoms).
Reciprocal Geometry Matrix -The reciprocal geometry matrix −1 is obtained by inverting the Euclidean distance collected in the geometry matrix, as [1,3]: 3D-Adjacency Matrix -The 3D-adjacency matrix ( G ) is obtained from the geometry matrix as [1,3]: where ⊗ indicates the Hadamard matrix product (or pairwise multiplication) and A is the adjacency matrix. Thus, the elements of the 3D-adjacency matrix are the bond lengths for pairs of bonded atoms, and zero otherwise [1,3].
Reciprocal squared geometrical distance matrix -The Reciprocal squared geometrical distance matrix ( ) is a square symmetric matrix containing the inverse of the squared Euclidean distances, taken pairwise, between the atoms of a molecule. Mathematically [1,3], is a square and symmetric matrix containing the ratios between the Euclidean distance and the topological distance, between each pair of atoms of a molecule. Mathematically [1,3], The equation of topology and connectivity descriptors are listed in the following section [1,3]: Geometric distance degree: Average geometric distance degree: 3D-Wiener index: 3D-connectivity indexes: Euclidean connectivity index: 3D-Balaban index: Variant of the geometric distance degree: Zagreb-like 3DM1 : Zagreb-like 3DM2 : Connectivity type 3D 0 X: Connectivity type 3D 1 X: Wiener-type indexes (3DWi ): Geometric eccentricity: = max ( ) (21) Geometric radius: 3D-Schultz index: 3D-MTI' index:

Geometric Descriptors
The equation of geometric descriptors is listed in the following section [1,3]: Molar Volume -The molar volume is defined as the ratio of the volume of a sample of that substance (expressed in liters, for example) to the amount of substance (usually expressed in moles) in the sample. It is experimentally measured as: Here, is the molecular weight and is the density of the liquid. The SI unit of molar volume is cubic meters per mole (m 3 /mol). Molecular Volume -The molecular volume is defined as the volume of the region within a molecule is constrained by its neighbors. It is calculated from density of the liquid as: Geometric Center -The geometric center of a molecule is the arithmetic mean of position of the atoms in a molecule [1,3].
It is calculated by averaging value of atom coordinates separately for each axis as: Variance -The distribution of atoms with respect to its center can be calculated by the variance as [1,3]: Here, is the position of atom. The square root of the variance is called the standard deviation ( ).
Pearson's first index -A measure of the asymmetry of the distribution of atoms can be calculated as [1,3]: where 3 the third power of its standard deviation. If 3 < 0 then a right-tailed distribution, and if 3 > 0, then a left-tailed distribution.
Kurtosis -A measure of the degree of bimodality of the distribution of atoms can be calculated as [1,3]: where 4 the fourth power of its standard deviation. = ∞ for a peak distribution, = 1 for a complete bimodal distribution, = 1.8 for uniform distributions, and = 3 for normal distribution.
Center of Mass -The center of mass of a molecule is the arithmetic mean of massposition of the atoms in a molecule. The center of mass of the molecule is calculated with the following formula [10,11]. It is calculated separately for each axis as: Here, is the atomic mass of th atom and = ∑ =1 .
End-to-end Distance -A simple size descriptor known as end-to-end distance of a molecule is defined as [10][11][12]: Here, is the interatomic distance between the first and the last atoms of the chain and ⃗ is the vector of the atom coordinates with respect to the center of mass.
Radius of Gyration -The radius of gyration is used to describe the dimensions of a molecule, more specifically, the distribution of atomic masses in a molecule. Therefore, it is a measure of molecular compactness. A small value is obtained when most of the atoms are close to the center of mass. The radius of gyration of a molecule is defined as [10,11], Here, = ∑ =1 and , respectively, are the total mass and center-of-mass position of the molecule and and , respectively, are mass and position of th atom.
Gravitational Indices -Geometrical descriptors reflecting the mass distribution in a molecule, defined as [1,3]: where and are the atomic masses of the considered atoms, , the corresponding interatomic distances, and A and B the number of atoms and bonds of the molecule, respectively. The 1 index considers all atom pairs in the molecule while the 2 index is restricted to pairs of bonded atoms. These indices are related to the bulk cohesiveness of the molecules, accounting, simultaneously, for both atomic masses (volumes) and their distribution within the molecular space. For modelling purposes, the square root and cube root of the gravitational indices were also proposed [1,3].
Mean Square Radius of Gyration -The mean square radius of gyration for the unperturbed chains can be derived from the Lagrange theorem: Hydrodynamic Radius -The hydrodynamic radius ℎ of a molecule is obtained from the mean reciprocal distance: Radius of Gyration Tensor -The gyration tensor of the conformations is defined as [10,11]

Eigenvalues of Gyration Tensor and Shape Parameters
Since the gyration tensor is a symmetric 3x3 matrix, a Cartesian coordinate system can be found in which it is diagonal (eigen) [10,11], Here, the axes are chosen such that the diagonal elements are ordered ≤ ≤ . These diagonal elements are called the principal moments of the gyration tensor [10,11]. The squared radius of gyration is the sum of the principal moments, i.e., 2 = + + [10,11].
Normalized Principal Gyration Ratios -Normalization can be performed by dividing the two lower eigenvalues of gyration tensor ( 1 and 2 ) by the highest value ( 3 ), generating two characteristic values of normalized principal gyration ratios (NGRs) for a molecule ( 1 and 2 ) as: The three parameters that describe the shape can determined by the principal moments of the gyration tensor as [10,11], Acylindricity -The acylindricity is always non-negative and zero only when the two principal moments are equal. The condition, = 0 is met when the distribution of particles is cylindrically symmetric which can also be true when the particle distribution is symmetric with respect to the two coordinate axes, e.g., when the particles are distributed uniformly on a regular prism.
Asphericity -It measures the deviation from spherical symmetry. The asphericity is always non-negative and zero only when the three principal moments are equal. The condition, = 0 is met when the distribution of particles is spherically symmetric which can also be true when the particles are distributed uniformly on a cube, tetrahedron or other Platonic solid.  Here, is the atom number, and and are the atomic mass and the perpendicular distance from the chosen axis of the th atom of the molecule, respectively. For any rectangular coordinate system, the inertia tensor can be given as [ The unit of gyration tensor is distance 2 where the unit of inertia tensor is mass×distance 2 . Although they have different units, the gyration tensor is related to the moment of inertia tensor. The key difference is that the particle positions are weighted by mass in the inertia tensor, whereas the gyration tensor depends only on the particle positions; mass plays no role in defining the gyration tensor [1].
Eigenvalues of Inertia Tensor and Shape Parameters [1,3] The eigenvalues of inertia tensor are called the principal moments of inertia. Principal moments of inertia are the moments of inertia corresponding to that particular and unique orientation of the axes for which one of the three moments has a maximum value, another a minimum value, and the third is either equal to one or the other or is intermediate in value between the other two. The corresponding axes are called principal axes of a molecule (or principal inertia axes). Moreover, the products of inertia all reduce to zero and the corresponding inertia matrix is diagonal. Conventionally, principal moments of inertia are labeled as: ≤ ≤ .
In general, the three principal moments of inertia have different values, but, depending on the molecular symmetry, they show characteristic equalities such as those shown in Table 2.
Normalization eliminates the dependency of the chosen representation on the size of the molecules under investigation. Furthermore, due to the intrinsic characteristics of the inertia tensor, the following relation will be fulfilled: Inertial shape factor -This is a shape factor based on the principal moments of inertia and defined as [1,3]: Here, 1 , 2 , and 3 are the principal moments of inertia].
Molecular eccentricity -It is a shape descriptor obtained from the eigenvalues of the inertia matrix defined as [1,3]: where = 0 corresponds to spherical top molecules and = 1 to linear and planar molecules. It is a shape descriptor defined by analogy with the eccentricity of an ellipse, which is defined as [1,3]: where and are the lengths of the major and minor elliptical axes, respectively.
Spherosity index -An anisometry descriptor defined as a function of the eigenvalues [1,3]: Spherosity index varies from zero for flat molecules, such as benzene, to one for totally spherical molecules [1,3].
Linearity index -Based on a similar approach of the unweighted WHIM shape index , it is defined as [1,3]: where the inertia moments are calculated on the unweighted atom coordinates and is the molecular weight. High values are associated with small linear molecules (4-7), whereas small values are associated with highly branched nonlinear molecules (0-0.5).
Characteristic ratio -A descriptor of average shape features of polymer. It can be considered as a measure of the degree of folding, defined as [1,3]: Here, 〈 2 〉 is the mean square radius of gyration averaged on all the conformations (or configurations), the number of bonds, and is the Kuhn length.
The characteristic ratio is also defined for the end-to-end distance , as [1,3]: Span A size descriptor defined as the radius of the smallest sphere, centered on the center of mass, completely enclosing all atoms of a molecule [1,3]: Here, is the distance of the th atom from the center of mass. The average span descriptor, calculated as the average value of conformational changes and denoted by ̅ is used to describe long chain molecules and is related to the Kuhn length.
Kuhn length -For long chain molecules, the Kuhn length is the mean of the bond distances, i.e., where is the number of bonds and is the bond distance between and bonded atoms [12].
Contour Length -The contour length is defined as: Plane of Best Fit Score -The average distance of all heavy atoms from the plane of best fit is proposed as a descriptor known as plane of best fit (PBF) score [14]. The coordinates of heavy atoms in the molecule are fitted to the plan with normalized equation The equation of best fit is then used to give the distance, Δ, of each heavy atom from the plane as The plane of best fit (PBF) score, has a theoretical range of [0, ∞). However, in practice, the PBF score tends to be below two for small drug-like molecules and below ten for proteins [14]. By normalizing the PBF score to the number of heavy atoms, it is applicable to any molecule irrespective of molecular size. The PBF score separates molecules closely clustered in NPR space, thereby allowing greater granularity of 3D shape characterization in molecular design and compound selection [14].
Molecular descriptors for distance/distance matrix [1,3]: Molecular profiles: {1 , 2 , 3 , 4 , 5 , 6 , … } (67) Here, Average distance/distance degree: D/D index: Folding degree index: Folding profile: -factor -It is a measure of branching which is calculated by the ratio of 2 for a given branched chain to that for the linear chain with the same molecular weight. It is also equal to the Wiener index normalized by the Wiener index of the linear chain.

Weighting Schemes
The following atomic properties are used as the atomic weightings for molecular descriptor calculation in PyL3dMD are: • Atomic properties are not always defined for the whole periodic table. Molecular descriptors based on a property that is not available for all atoms of a molecule will result in 'na'. Therefore, for united atom potential, all the weighted molecular descriptors will result in 'na'.
This weighing scheme is applied to GETAWAY, WHIM, RDF, ATS, MATS, GATS, and MoRSE descriptors.

GETAWAY Descriptors
ETAWAY stands for Geometry, topology, and atom-weights assembly descriptors that have been proposed as chemical structure descriptors derived from the molecular influence matrix , its diagonal elements ℎ called leverages, and influence/distance matrix . Leverages encode information related to the "influence" of each atom of a molecule in determining the whole shape of a molecule [1,3].
Molecular Influence Matrix -The molecular influence matrix is a square symmetric of size × , where is the number of vertices (atoms). It shows rotational invariance with respect to the molecule coordinates. It is calculated as [1]: where is the molecular matrix comprising the centered cartesian coordinates , , of the molecule atoms, including hydrogen, in a chosen conformation. Atomic coordinates are assumed to be calculated with respect to the geometrical center of the molecule to obtain translational invariance.
Influence/Distance Matrix -The influence/distance matrix is a square symmetric of size × , where is the number of vertices (atoms). It encodes spatial relationships between pairs of atoms of a molecule. The diagonal elements of the matrix are zero, while each offdiagonal element is calculated by the ratio of the geometric mean of the corresponding th and th diagonal elements of the matrix over the interatomic distance provided by the geometry matrix . Mathematically [1,3], Here, ℎ and ℎ are the leverages of the atoms and , and is their geometric distance. The square root product of the leverages of two atoms is divided by their interatomic distance to make less significant contributions from pairs of atoms far apart, according to the basic idea that interactions between atoms in the molecule decrease as their distance increases. Obviously, the largest values of the matrix elements are derived from the most external atoms (i.e., those with high leverages) and simultaneously next to each other in molecular space (i.e., those having small interatomic distances) [3].
Most of the GETAWAY descriptors are simply calculated only by the leverages used as the atomic weightings [1,3].
Geometric mean of the leverage magnitude: Total information content on the leverage equality: Standardized information content on the leverage equality Mean information content on the leverage magnitude: Average row sum of the influence/distance matrix: R-connectivity index: R-matrix leading eigenvalue: GETAWAY descriptors based on autocorrelation functions -The GETAWAY descriptors also comprised of autocorrelation vectors obtained by double-weighting the molecule atoms in such a way as to account for various atomic properties with 3D information encoded by the elements of the molecular influence matrix and influence/distance matrix [1,3].
HATS indexes: HATS total index: H indexes: H total index: R indexes: R total index: Maximal R indexes: Maximal R total index: Here, is the number of molecule atoms (hydrogen included), is the topological diameter; is the topological distance between atoms and , and is a physico-chemical property of the th atom.

WHIM Descriptors
WHIM stands for Weighted holistic invariant molecular descriptors. They are based on statistical indexes calculated by projecting atoms along principal axes. WHIM descriptors capture 3D information regarding molecular size, shape, symmetry, and atom distribution with respect to invariant reference frames. They are calculated by calculating eigenvalues and eigenvectors of a weighted covariance matrix of the centered cartesian coordinates of the atoms of a molecule. This weighted covariance matrix is given by [1,3] where is the weighted covariance between the th and th atomic coordinates, is the number of atoms, is the weight of the th atom, and represent the th and th coordinate ( , = , , ) of the th atom, respectively, and ̅ the corresponding average value.
Using this weighted covariance matrix, the following WHIM descriptors are calculated [ Here, refers to eigenvalues of the weighted covariance matrix; refers to atomic coordinates with respect to the principal axes; is the number of molecule atoms; is the number of symmetric atoms along a principal axis; and is the number of asymmetric atoms.

RDF Descriptors
Radial distribution function (RDF) descriptors were proposed based on an RDF that is quite often used for the interpretation of the diffraction patterns obtained in powder X-ray diffraction experiments. The RDF of an ensemble of atoms can be interpreted as the probability distribution to find an atom in a spherical volume of radius . The general form of the radial distribution function is represented [1,3] by where is a scaling factor, are characteristic atomic properties of the atoms and , is the interatomic distance between the th and th atom, is the radius of the spherical volume, and is the number of atoms. is the smoothing parameter which defines the probability distribution of the individual distances. PyL3dMD uses a equal to 100 Å −2 . PyL3dMD calculates RDF vector of 30 values starting from 1.0 Å up to 15.5 Å using a step size of 0.5 Å for .

3D-MoRSE Descriptors
3D-MoRSE stands for 3D-molecule representation of structures based on electron diffraction [15]. The 3D-MoRSE descriptors translate the 3D coordinates into a molecular code with a modified equation used in electron diffraction studies for preparing theoretical scattering curves [16,17].
Here, is the number of atoms, is one of the atomic properties used as the atomic weightings, is the scattering angle, and is the geometrical distance between the atoms and .
It is calculated for 30 evenly distributed values of in the range of 0.5-15.0 Å -1 from the 3D coordinates of the molecule. More details, including physical significance, of 3D-MoRSE descriptors can be found here [9]. These descriptors are usually calculated for molecules having 3D coordinates for all atoms, including hydrogen atoms.